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tA Accumulating Transformations for Hierarchical Linear Regression HMM 
T.TSJfr a Adaptation 

§1 

of Invention 

This invention relates to speech recognition and more particularly to adaptive speech 
recognition with hierarchical linear regression Hidden Markov Model (HMM) adaptation. 
Background of Invention 

Hierarchical Linear Regression (HLR) (e.g. MLLR [See C. J. Leggetter and P.C. 
Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density 
HMMs," Computer, Speech and Language, 9(2):71-185,1995]) is now a common technique to 
transform Hidden Markov Models(HMMs) mod e ls for use in an acoustic environment different 
from the one in which the models are initially trained. The environments refer to speaker accent, 
speaker vocal tract, background noise, recording device, transmission channel, etc. HLR 
improves word error rate (WER) substantially by reducing the mismatch between training and 
testing environments [See CJ. Leggetter cited above]. 

Hierarchical Linear Regression (HLR) (e.g. MLLR or Maximum Lik e lihood Lin e ar 
R e gr e ssion) is an int e ractiv e process that creates a set of transforms that can be used to adapt any 
subset of an initial set of Hidden Markov Models (HMMs) mod e ls st e p - by - step into a targ e t 
model- new acoustic environment . We refer to the new environment as the "target environment", 
and the adapted subset of HMM models as the "target models". The HLR adaptation process 
requires that some adaptation speech data from the new environment be collected, and converted 
into sequences of frames of vectors of speech parameters using well-known techniques. For 
example, to create a set of transforms to adapt an initial set of speaker-independent HMMs to a 
particular speaker who is using a particular microphone, adaptation speech data must be 
collected from the speaker and microphone, and then converted into frames of parameter vectors, 
such as the well-known cepstral vectors. 

There are two well known HLR methods for creating a set of transforms. In the first 
method, the adaptation speech data is aligned to states of the initial set of HMM models using 
well-known HMM Viterbi recognition alignment methods. A regression tree is formed which 
defines a hierarchical mapping from states of the initial HMM model set to linear transforms. 
Then the set of linear transforms is determined that adapts the initial HMM set so as to increase 
the likelihood of the adaptation speech data. While this method results in better speech 
recognition performance, further improvement is possible. The second method uses the fact that 
transforming the initial HMM model set by the first set of linear transforms yields a second set of 
HMMs. This second set of HMMs can be used to generate a new alignment of the adaptation 
speech data to the second set of HMMs. Then it is possible to repeat the process of determining 
a set of linear transforms that further adapts the second HMM set so as to increase the likelihood 
of the adaptation data. This process can be repeated iteratively to continue improving the 
likelihoods. However, this requires that after each iteration either a new complete set of HMMs 
is stored, or that each new set of linear transforms is stored so that the new HMM set can be 
iteratively derived from the initial HMM set. This can be prohibitive in terms of memory storage 
resources. The subject of this invention is a novel implementation of the second method such 
that only the initial HMM set and a single set of linear transforms must be stored, while 
maintaining exactly the performance improvement of the second method and reducing the 
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processing required. This is important in applications where memory and processing time are 
critical and limited resources. 

Typically, th e it e ration r e quir e s M alignm e nts of transformed HMM against sp ee ch data and e ach 
alignm e nt r e sults ar e us e d to produc e n e w transformations through N EM r e e stimations. Thus 
MxN st e ps ar e r e quir e d. 

Curr e nt m e thods builds th e mod e ls at m th st e p from models at m — 1 th step. Each st e p 

produc e s a s e t of transformations that is used by th e st e p n e xt to it. — To r e produc e th e targ e t 
mod e ls, th e M x N transformations hav e to b e stor e d and lat e r appli e d to th e initial mod e ls. 

Curr e nt m e thods builds th e mod e ls at m th st e p from mod e ls at m — 1 th st e p. Each st e p 

produc e s a s e t of transformations EM (Exp e ctation Maximization) that is us e d by the st e p n e xt to 
it, as illustrat e d in Figur e 1 . At th e r e cognition time, to g e t th e targ e t mod e ls, two alt e rnativ e s 
can b e consid e r e d. Th e first is to stor e th e mod e l s e t obtain e d at th e MxN transformations. As 
typical continuous sp ee ch r e cogniz e rs may us e t e ns of thousands of m e an v e ctors, storing th e 
additional param e t e rs of that siz e is unaffordabl e for situations as in sp e ech r e cognition on 
mobil e d e vic e s. Th e s e cond is to apply succ e ssiv e ly th e MxN transformations to th e initial 
mod e l s e t, as illustrat e d by Fig 2. This r e quir e s storing th e M x ^transformations. Typically the 
storag e r e quir e m e nt is substantially low e r. How e v e r, it is still prohibitiv e for typical emb e dd e d 
syst e ms such as a DSP bas e d on e . Notic e that, as r e pr e s e nt e d by th e siz e of th e box e s in Figur e 2, 
th e numb e r of transformations in e ach transformation st e p may b e diff e r e nt. 
Summary of Invention 

A new method ? is introduced which builds th e mod e ls at m th st e p dir e ctly from mod e ls 
at th e initial st e pas provid e d to minimiz e th e storag e and calculation the set of HLR adapted 
HMM models at any iteration directly from the initial set of HMMs and a single set of linear 
transforms in order to minimize storage. Further, the method introduces a procedure that merges 
the multiple sets of linear transforms from each iteration into a single set of transforms while 
Th e merg e guarant ee s guaranteeing the performance is identical to the present-art iterative 
methods, e xactn e ss of th e transformations and mak e it possibl e for r e cogniz e rs on mobil e 
d e vic e s to hav e adaptation capability. The goal of the method to be described is to provide a 
single set of linear transformsatiens which combines all of the prior sets MxN s e t of 
transformations, so that a target model subset at any iteration can be calculated directly from the 
initial model set and the single set of transforms ations. Figur e 3 illustrat e s th e goal. 

Th e combination guarant e es th e e xactn e ss of th e total transformations, i. e . th e r e sulting 
mod e ls obtain e d by th e singl e s e t of transformations ar e th e sam e as th e targ e t mod e ls obtain e d 
by succ e ssiv e applications of transformations. This r e sults mak e it possibl e for r e cogniz e rs on 
mobile d e vic e s to hav e adaptation capability. 

Description of Drawings: 

Figure 1 illustrates typ e s of it e rations the present art HLR methods where the final target 
models are obtained by successive application of several sets of transformations TkT? 

Figure 2 illustrates present-art Ttarget models are obtained by successive application of 
several sets of transformations transforms having different hierarchical mappings of HMM 
models to transforms. 

Figure 3 illustrates new Ttarget models are obtained by a single application of one set of 
transforms T and one set of hierarchical mappings of HMM models to transformations . 
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Figure 4 illustrates part of a regression tree which maps HMM models to transforms . 

Figure 5 illustrates the operation of multiple iterations of Estimate-Maximize (EM) 
adaptation according to one embodiment of the present invention. 

Figure 6 illustrates the system according to one embodiment of the present invention. 
Description of Preferred Embodiment 

In accordance with the present invention th e m e thod builds the mod e ls at th e m th st e p 
dir e ctly from mod e ls at th e initial step by succ e ssiv e applications of transformations as , as 
illustrated in Figures 3 and Fig 5 -, the disclosed method using multiple iterations of the well- 
known Estimate-Maximize fEM) algorithm, builds a single set of linear transforms that can 
transform any subset of the initial HMM model set to adapt it to a new environment. This is 
accomplished by a novel method which combines multiple sets of linear transforms into a single 
transform set. Fig. 1 illustrates the process of creating sets of linear transforms according to 
present-art. The process begins with an initial model set M n , and speech data collected in the 
new environment. In addition, the process starts with a hierarchical regression tree, of which a 
portion is illustrated in Fig. 4. In the preferred embodiment, the hierarchical regression tree is 
used to map initial monophone HMM models to linear transforms. While in the preferred 
embodiment the mapping is from monophone HMM models to linear transforms, it should be 
understood that the mapping could be from any component of an HMM model, such as a 
probability density function or cluster of distributions. The hierarchical regression tree is used 
during creation of the set linear transforms to determine how many linear transforms will exist, 
and what data is used to generate, each linear transform. This will be described in detail below. 

As can be seen in Fig. 1, the process of creating linear transforms is iterative. At the start 
of the process, the adaptation speech data is aligned with the initial model set M o using well- 
known Viterbi HMM speech recognition procedures. This results in a mapping defining which 
portions of the adaptation speech data correspond to monophone models of the initial HMM set. 
It is possible that the adaptation speech data does not contain any instance of some monophones. 
It is still desirable to create linear transforms that can be used to transform even those 
monophones for which there is little or no adaptation data. This is the purpose of the hierarchical 
regression tree. Once the alignment between adaptation speech and monophone HMMs is 
performed, a count of number of adaptation speech frame occurrences mapping to each 
monophone in the adaptation data is made. A cumulative sum of the number of occurrences of 
monophones under each node of the regression tree is made. A linear transform will be 
constructed for each monophone HMM or group of monophone HMMs such that the cumulative 
sum at the lowest node connected to the monophone is at least as large as a threshold value. For 
example, consider the UW, UH, and AX monophones in the regression tree of Fig.4. Suppose 
the threshold value is set to 100, and that there are 100 instances of the adaptation frames 
mapping to monophone AX in the training data, 2 instances mapping to the UW monophone, and 
1 instance mapping to the UH monophone. According to the regression tree of Fig. 4, a linear 
transform will be created for the AX monophone itself since there are 100 instances mapping to 
AX in the adaptation data. There are not enough instances mapping to UW or UH to create a 
unique transform for each of these monophones. Continuing up the regression tree from UW and 
UH, the cumulative sum is 3 instances. This is still not greater than the threshold. Continuing 
further up the regression tree, the cumulative sum for UW, UH and AX is 103, which is larger 
than the threshold value, so the adaptation data for the UW, UH, and AX monophones will be 
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combined to form 103 instances that will be used to form a linear transform that will be used to 
adapt both the UW and UH monophones. 

Referring again to Fig 1, the aligned adaptation data is used in a well-known Expectation 
Maximization (EM) algorithm to calculate maximum likelihood estimate of the parameters of the 
linear transform set TV The set of transformations can be applied to the initial HMM model 
set M o to form a new set of models Mu At this point, the procedure can be iterated. While the 
first step of the next iteration would typically be aligning the adaptation data with the new model 
set Mk we have found that we can obtain equally good recognition performance improvement by 
only performing alignment each N-th iteration, where N is usually 3 or 4. Between alignment 
iterations, only the EM process is performed. This saves additional computation, since the 
alignment process does not need to be performed for each iteration. 

Referring to Fig. 1 , in present art HLR adaptation systems, either the successive sets of 
HMM models Mu M?, etc, or the sets of transformations, Tu T?, etc, must be stored to continue 
iteration. Typically, since model sets are much larger than transformation sets in memory 
storage requirements, it would be preferable to store the sets of transformations. This, of course, 
requires dynamically calculating the new HMM model set by applying in succession each 
transformation Tu T?, etc, increasing greatly the amount of computation required. This is 
illustrated in Fig 2, where it must be noted that each linear transform set also has a distinct 
hierarchical mapping, since counts of monophones at each hierarchical tree node may be 
different. As a novel aspect of this invention, we describe below a method, illustrated in Fig. 3, 
whereby transformations can be merged at each iteration. This results in a large saving of 
computation and memory storage. It also provides flexibility, since only the initial HMM model 
set needs to be stored along with a single set of transforms, and any subset of the initial HMM 
model set can be adapted by the set of transforms for limited recognition tasks. 

The method of implementing HLR adaptation with merged transforms is now described 
in detail. Th e algorithms for providing this ar e d e riv e d h e r e in in th e following. 

Let 3 = {<£ , , £ 2 % N } be the set of nodes of the regression tree. Leaf nodes ficHof the tree 
correspond to a class which needs to be adapted . A class can be either an HMM, a cluster of 
distributions, a state PDF, etc., depending on the adaptation scheme. In the preferred 
embodiment, it is a monophone HMM. A leaf node a e Q is assigned the number m(a,i n) of 
adaptation frame vectors associated with the node at iteration 7 i n by the alignment of the 
adaptation speech data to the leaf node class. As mentioned previously For illustration ,Figure 4 
shows part of a tree with leaves corresponding to monop hone HMMs. 
W e introduc e Define the function : 

(/> :Sh 3 

such that g ; = $(4 t ) j = j * k is the root of the node -|- £ A (the node above £ k }. 
Similarly, we introduce the function 
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<p :Hx[o,l]h-> E 

such that % = <f>(<p(%J)) 4 = </>(<p(%> k)) , i.e. ffOg, /) <p(4, A:) is the *-tfe £4h descendent of the 
node 4 . 

At each iteration of parameter estimation, to each node is associated a number p{4,n) 
q>(4, k) recording the total number count of the cumulative number of adaptation speech data of 

input vectors under the node. 





m&i) if £en 




X* P(!P(4>i) otherwise 



J t P(P(^)>' 1 ) otherwise 



A node is called reliable if 

where P is a constant, fixed for each alignment. The function 

^:HxNh> [False, True] 

such that iff (4 > 0 ¥(4* w ) indicates if a node is reliable at the *4h w-th iteration. Note that at 
each iteration, since the alignment between leaf nodes and speech signals may change, y/ is a 
function of i n. Only reliable nodes are assigned a transform -?y . Each leaf node, which in 

the preferred embodiment is an HMM, has its transform located on the first reliable node given 
by recursively tracing back to the roots. 

Introduce another function w e introduc e is^ 

%: HxNh->S 

such that g**%(4>0 C = Z (4> n ) * s the first root node of 4 that satisfies y/ (g, i) - 
(£\ «) = True. 
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The invention uses a general form for the linear regression transforms, which applies a 
linear transform T to a mean vector /u of a Gaussian distribution associated with an HMM state: 

//=T(//) = A//+B 

W e us e g e n e ral form for lin e ar r e gression transformation, which appli e s a lin e ar transformation T 
to th e m e an v e ctor of Gaussian distributions: 

where A is a D x D matrix, and r a D dim e nsional column v e ctor,and B a D-dimensional 
column vector. W e assum e that at any st e p, th e curr e nt mod e l is always obtain e d by 
transforming th e initial mod e l. W e always map th e original mod e ls As a novel aspect of the 
invention, at any iteration the current model corresponding to a leaf node a is always obtained 
by transforming its initial model means. That is, the original model means are mapped to the 
adapted model means at iteration n as : 

* ^ n * * 



The merging of transforms is now described in detail. Referring to Figure. 1 , we 
distinguish there can be distinguished two types of parameter estimation iterations: between EM 
iterations and between alignment iterations. Corr e spondingly, in th e next two s e ctions w e will 
study two typ e s of transformation combinations: Each type of iteration requires a unique method 
to merge transforms. The method of combination for each time is described below. 

• — Transformation accumulation b e tw ee n EM e stimations. 
♦ — Transformation accumulation betw ee n alignm e nt it e rations. 

Tran s formation accumulation Merging transforms between EM estimations 

Given 

• The set of transforms that maps the initial models through n - 1 EM r e e stimations 
(global at n D iterations which we term a global transform set at n- 1 . 

• The set of transformations that maps the models at the n - 1-th iteration to the models at 
the iteration n (local at f using EM estimation with no alignment, which we term a local 
transform set at n . 

W e want to find th e s e t of accumulat e d transformation, global at n, which combin e s th e 
global at n 1 and local at n (local at n). 

The goal is to determine the resultant merged transform set that will be global at ft, and will 
combine the global at n - 1 and local at n transform sets. 
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As b e tw e en two EM the alignment is fix e d, th e r e liabl e nod e information is unchang e s. 
Th e r e for e th e association b e twe e n nod e s and transformations is fix e d from at th e two EM 
it e rations. 

It is important to note that between EM re-estimation iterations, no alignment of the 
adaptation speech data to the adapted models is performed, in order to save computation. Since 
no alignment is performed between the EM re-estimation iterations, the alignment is fixed, so the 
reliable node information is unchanged, and the association between nodes and transforms is 
fixed. That is, between the EM re-estimation iterations the functions p, and y remain fixed. 

At any giv e n alignm e nt, for e ach £ c 5 and ip(£J) , l e t A n h , and B n t c 

Let A^ e and B n 1 ^ _be the global transformation parameter set derived at E M iteration 
n - 1, and A n ^ and B n 4 be the local transformatio n parameter set derived at EM iteration n-±n, 
and A n ^ and B„^be th e local transformation d e riv e d at EM it e ration n . Then a the single 

transformation- set global at n formed by merging is denoted as A n ^ and B n ^ g , and is combin e d 
from th e two transformations calculated for all <t such that if/(£,n) is True as: 








+ B 



Proposition 1 
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Let the above merging operations of transform sets be denoted as: 

Transformation accumulation Merging transforms between alignment iterations 

Given 

• The set of transforms that maps the initial models through *-+ nA alignm e nts (global at / 
44 iterations and using the Mth alignment which is global for n-\ and M. 

• The set of transformations that maps the models at the n- 1th / 1 th alignm e nt to the 
mod e ls at th e alignm e nt /(local at i) iteration and the /-th alignment to the models at the /- 
th alignment and iteration which is local at n. 

• The set of reliable node information given by the functions p, v|/„ and y which is valid for 
alignment M. 

• The set of reliable node information given by the functions p, and y which is valid for 
alignment i 

We want to find The goal is to determine the set of accumulated transformations, global at n and 
/, which combines the global transform set at iteration k-1 and alignment / - 1 and locat e at / 
transformations the local transformation at iteration n and alignment /. 

Diff e r e nt from In contrast to the accumulation between two EM iterations, the alignment here 
may be changed, which results in a change in the reliable node information. Therefore the 
association between nodes and transformations cannot be assumed fixed from the / - 1 to /-th 
alignment and n-\ to K-th iteration . For instanc e , t The number of transformations at alignment / 
is may be different from that at / - 1 for two reasons: 

• The value of the fixed constant P in Eq.2 may b e diff e r e nt change . Typically, P is 
decreased to increase the number of transformations as the number of alignments / 
increases. 

• Even if P is kept constant a cross alignm e nt , as th e acoustic mod e l param e t e rs ar e 
chang e s at e ach alignm e nt, p(<* ,/) may chang e as function of /, so will since the 
HMM parameters are different at each alignment, the functions p, \|/, and y may change as 
a function of L 

Th e combin e d set of transformation is sp e cifi e d by Eq 10. 
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Propo s ition 2 E: 

Then merged global transformation set is given by: 



if y/{^,i-\) Ay/(^,i) 



t \ A © t; 

TVowe Otherwise 





T'- 1 © 


if 






if ¥ ^,i-\)K-, ¥ ^,i) 






if -, V {Z,i-\)Ky,{£,i) 




None 


Otherwise 



PROOF: 

-V^cS, only on e of four situations can happ e n: 

1 . It is a r e liabl e nod e at both it e rations i 1 and i. Th e paramet e rs of th e models under this 

nod e ar e th e r e for e transformed by Tf l and then by 7^ 

2. It is r e liabl e nod e at it e ration i 1 but not at it e ration /. Th e transformation at / is 

th e r e for e th e on e at the nod e /). Th e param e ters of the mod e ls und e r nod e ar e th e r e for e 
transform e d by Tf x and th e n by . 

3. It is a reliable nod e at it e ration I but not at it e ration i 1 . Th e transformation at i 1 is 

th e r e for e th e on e at th e nod e %fe i 1) . Th e param e t e rs of th e mod e ls und e r node — ar e ther e for e 
transformed by Tf x (g, i 1) and then by T l f x . 

It is not a r e liabl e nod e at both it e rations. Th e nod e has th e r e for e no transformation. 

In th e fourth cas e , no transformation will b e g e n e rat e d. 



Referring to Figure 6, there is illustrated a system according to one embodiment of the 
present invention wherein the input speech is compared to models at recognizer 60 wherein the 
models 61 are HMM models that have had HLRHMM adaptation or training been adapted using 
only a single set of transformation param e t e rs wh e r e in for transformation accumulations b e tw ee n 
EM e stimations e quation 4 is used and for transformation accumulation b e tw ee n alignm e nt 
it e rations is according to equation 10 linear transforms . The single set of linear transforms 
utilize parameters wherein multiple EM iterations and multiple alignments to adaptation speech 
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data have been used to generate multiple sets of transforms, which are merged according to the 
present invention to form the single set of linear transforms. 
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In the Claims . s ^ 

1 . (currently amended) A method of hierarchical linear regression to develop a set of 
linear transforms for adaptation o f an initial set of Hidden Markov Models ( HMM) models to a 
new environment comprising the steps of: 

providing an initial set of HMM models for adaptation to a n e w acoustic e nvironm e nt and 

obtaining adapt e d n e w mod e ls directly from initial HMM mod e ls using a singl e s e t of 
transformations adaptation speech data from a new environment, 

adaptinR the initial set of models to the new acoustic environment by a procedure 
comprising the steps of creating an alignment of the adaptation speech data to the HMM model 
set, then performing the iterative steps of Estimate-Maximize (EM) estimation to generate a local 
set of linear transforms, merging the local set of linear transforms with a set of prior global 
transforms to form a new global set of transforms, adapting the initial set of HMM models using 
the new global set of transforms, and beginning a new EM estimation iteration step to repeat the 
procedure. 
Claim 2 (canceled) 
Claim 3 (canceled) 
Claim 4 (canceled). 

5. (new) The method of Claim 1 wherein after a number of EM estimation iteration steps 

the steps of realigning the adaptation speech data with the adapted HMM models wherein 
parameters can be adjusted to expand the set of linear transforms, performing an EM estimation 
step to generate a new set local transforms, combining the new local transforms with the prior set 
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of global transforms to form a new set of global transforms in accordance with the new 
alignment, and further performing iterative steps of EM estimation. 
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Abstract 

A n e w m e thod, which builds th e mod e ls at m th st e p dir e ctly from mod e ls at th e initial 
st e p, is provided to minimiz e th e storag e and calculation. Th e m e thod ther e for e m e rg e s th e MxN 
transforms into a singl e transformation. Th e merg e guarant ee s th e e xactn e ss of th e 
transformations and mak e it possibl e for r e cogniz e rs on mobilr d e vices to hav e adaptation 
capability. 

A new iterative hierarchical linear regression method for generating a set of linear 
transforms to adapt HMM speech models to a new environment for improved speech recognition 
is disclosed. The method determines a new set of linear transforms at an iterative step by 
Estimate-Maximize (EM) estimation, and then combines the new set of linear transforms with 
the prior set of linear transforms to form a new merged set of linear transforms. An iterative step 
may include realignment of adaptation speech data to the adapted HMM models to further 
improve speech recognition performance. 
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